init main repo structure and demonstrate the AR + DiT demo for omni models#6
init main repo structure and demonstrate the AR + DiT demo for omni models#6hsliuustc0106 merged 20 commits intomainfrom
Conversation
- Add comprehensive PRD, architecture design, and test design documents - Implement core modules: OmniLLM, AsyncOmniLLM, stage configurations - Add DiT scheduler and cache manager for diffusion models - Implement CLI integration with --omni flag support - Add API server and plugin system for vLLM integration - Create comprehensive test suite with fixtures - Update dependencies to vLLM 0.10.2 and PyTorch 2.8.0 - Add conda environment setup and package installation - Implement stage-based processing architecture - Add multimodal output processing capabilities This commit establishes the foundation for multi-modality models inference and serving with non-autoregressive structures.
Summary of ChangesHello @hsliuustc0106, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request lays the complete groundwork for Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request initializes the repository structure for vllm-omni, a multi-modal extension for vLLM. It includes extensive documentation covering product requirements, architecture, and testing design, along with skeleton code for the core components. The overall structure is well-thought-out and aligns with the project's goals.
My review focuses on identifying potential issues in the initial implementation. I've found a critical import error that will break the code, some incorrect logic in the cache manager and scheduler, and several typos in the documentation. I've also noted a dependency on a non-existent PyTorch version which will cause installation failures. Addressing these points will help build a more robust foundation for the project.
| def _create_seq_group_from_request(self, request: Dict[str, Any]) -> Any: | ||
| """Create a sequence group from a DiT request.""" | ||
| # This would create a proper sequence group | ||
| # For now, we'll return a mock implementation | ||
| from vllm.v1.core.sched.sequence import SequenceGroup | ||
|
|
||
| # Mock sequence group creation | ||
| # In practice, this would properly create a SequenceGroup | ||
| # with the appropriate metadata for DiT processing | ||
| return None |
There was a problem hiding this comment.
The _create_seq_group_from_request method currently returns None. This will cause a TypeError when the return value is used, for example, when it's appended to scheduled_seq_groups and then iterated over. This method should return a valid SequenceGroup object or a placeholder that doesn't break downstream logic.
| prompt_str, engine_request, tokenization_kwargs = self._process_stage_inputs(stage_config, **stage_args) | ||
|
|
||
| # Add inputs to Engine | ||
| stage_engine.add_request(requesy_id, prompt_str, tokenization_kwargs) |
| response_outputs = [] | ||
| for output in outputs: | ||
| if hasattr(output, 'outputs') and output.outputs: | ||
| for out in output.outputs: | ||
| response_outputs.append({ | ||
| "text": getattr(out, 'text', ''), | ||
| "finished": getattr(out, 'finish_reason', 'length') != 'length', | ||
| "tokens": getattr(out, 'token_ids', []) | ||
| }) | ||
| else: | ||
| response_outputs.append({ | ||
| "text": "", | ||
| "finished": True, | ||
| "tokens": [] | ||
| }) | ||
|
|
There was a problem hiding this comment.
The response generation logic in the /generate endpoint seems to only handle text-based outputs. It extracts text, finish_reason, and token_ids from the RequestOutput. This is inconsistent with the project's goal of supporting multimodal outputs (like images), and the MultimodalOutputProcessor which is designed to produce outputs with image or latent data. The response model and logic should be updated to handle and serialize multimodal outputs correctly.
- Improve API server with better error handling and response formatting - Enhance CLI with additional options for DiT stages and configuration - Add comprehensive examples in examples/basic/ including: - API client with health checks and text generation - Docker setup and usage examples - Simple usage patterns for different scenarios - Add utility scripts for model downloading and Docker setup - Update documentation with implementation details and testing guidelines - Fix configuration validation issues in OmniLLM - Improve stage configuration handling for AR and DiT stages - Add proper error handling and fallback mechanisms Tested with Qwen3-0.6B model: - Server starts successfully on port 8000 - Health and info endpoints working correctly - Text generation with various parameters functioning - API client examples working as expected - CLI help and configuration options working properly
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
- Move vllm_omni/core/omni_llm.py to vllm_omni/entrypoints/omni_llm.py - Update all import statements across the codebase to reflect new location - Fix relative imports within the moved file - Maintain functionality while improving code organization - All imports and functionality tested and working correctly This change better reflects that OmniLLM and AsyncOmniLLM are the main entry points for vLLM-omni functionality, rather than core implementation details.
- Add test_serving.sh: Full-featured testing suite with comprehensive validation - Add quick_test.sh: Fast validation script for quick testing after changes - Add scripts/README.md: Complete documentation for testing scripts - Include health checks, text generation, performance testing, and API integration - Add retry mechanisms and proper error handling - Support for different models and ports - Comprehensive logging and colored output - Ready for CI/CD integration Usage: - Quick test: ./scripts/quick_test.sh [port] - Full test: ./scripts/test_serving.sh [model_path] [port]
| input_modalities=["text"], | ||
| output_modalities=["text"] | ||
| ) | ||
| stage_configs.append(ar_config) |
There was a problem hiding this comment.
| prompt_logprobs=None, | ||
| outputs=[mock_output], | ||
| finished=True | ||
| ) |
There was a problem hiding this comment.
Bug: Async Class Calls Sync Method
The AsyncOmniLLM class isn't fully asynchronous. It inherits from LLM and its generate_async method calls the synchronous super().generate(), which blocks the event loop. Additionally, within _execute_stage_async, DiffusersPipelineEngine is initialized with parameters that don't align with its constructor signature.
correct _thinker_to_talker_prefill to handle multiple segments inside one chunk
P0 fixes: vllm-project#1: _free_scaffold_weights now shrinks storage to zero (actually releases VRAM). Only runs when SKIP_SCAFFOLD is also set. Called lazily after first prefill, not at load time. vllm-project#2: Sliding VAE default OFF (splice algorithm had alignment bug). _sliding_vae_decode now falls back to full decode until proper overlap-add is implemented. vllm-project#3: Complete per-request state reset in preprocess: now clears _curr_prefix_feat_cond, _last_audio_patch_gpu, _prev_audio, _prev_audio_len, _decode_step_count, _precomputed_stop_logits. vllm-project#4: compute_logits fallback forces stop (not continue) when _prefill_completed=True, preventing runaway generation. vllm-project#5: Scaffold VRAM: load_weights no longer frees immediately; _free_scaffold_weights called after first prefill completes, so scaffold is available for prefill then released. P1 fixes: vllm-project#6: Log all active config flags at load time. vllm-project#7: Remove dead _STOP_CHECK_INTERVAL code. vllm-project#8: Remove broken audio_duration formula from postprocess. vllm-project#9/vllm-project#14: Move `from einops import rearrange` to module top level. vllm-project#11: Remove torch.no_grad() context from _forward_decode_graphable (incompatible with CUDA Graph capture).
P0 fixes: vllm-project#1: _free_scaffold_weights now shrinks storage to zero (actually releases VRAM). Only runs when SKIP_SCAFFOLD is also set. Called lazily after first prefill, not at load time. vllm-project#2: Sliding VAE default OFF (splice algorithm had alignment bug). _sliding_vae_decode now falls back to full decode until proper overlap-add is implemented. vllm-project#3: Complete per-request state reset in preprocess: now clears _curr_prefix_feat_cond, _last_audio_patch_gpu, _prev_audio, _prev_audio_len, _decode_step_count, _precomputed_stop_logits. vllm-project#4: compute_logits fallback forces stop (not continue) when _prefill_completed=True, preventing runaway generation. vllm-project#5: Scaffold VRAM: load_weights no longer frees immediately; _free_scaffold_weights called after first prefill completes, so scaffold is available for prefill then released. P1 fixes: vllm-project#6: Log all active config flags at load time. vllm-project#7: Remove dead _STOP_CHECK_INTERVAL code. vllm-project#8: Remove broken audio_duration formula from postprocess. vllm-project#9/vllm-project#14: Move `from einops import rearrange` to module top level. vllm-project#11: Remove torch.no_grad() context from _forward_decode_graphable (incompatible with CUDA Graph capture).
vllm serve --omnifor 1) AR models only and 2) AR +DiT modelsTest Plan:
we test the following scenarios:
Test Results:
==========================================
Note
Introduce vLLM-omni multi-stage (AR→DiT) pipeline with CLI
vllm --omni, FastAPI server, diffusers-backed diffusion, configs/output processing, examples, tests, and scripts; update dependencies.OmniLLM/AsyncOmniLLM,StageManager,OmniRequest, and implementation docs.OmniStageConfig,DiTConfig,DiTCacheConfig(+ helpers) and remove legacydit_cache_interface.engine/output_processor.py).engine/diffusion_engine.py,worker/gpu_diffusion_model_runner.py,worker/gpu_diffusion_worker.py.core/dit_cache_manager.py).vllm serve --omni(entrypoints/cli/*, pyproject scripts)./generate,/health,/info(entrypoints/api_server.py).examples/*).scripts/test_serving.sh).tests/*).vllm>=0.10.2,torch>=2.7; addPyYAML; expose new CLI scripts inpyproject.tomland expandrequirements.txtdev tools.Written by Cursor Bugbot for commit 05d2367. This will update automatically on new commits. Configure here.